Red Wine Quality Exploration by IMAM RAZI

This report explores a dataset containing quality and attributes of 1599 red wines. Our target is to findout what attributes influance the quality of the wine.

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Wine data has 1599 observations and 13 variables.

Based on the plot, Most of the red wines fall under average quality.To better understand this, I have created a new variable ‘quality.score’ which maps the wine quality score to Poor , Average, Good and Excellent. Quality score of less than “4” falls under “Poor”, Score of “5” falls on “Average”, “6” goes to “Good” and Above “6” are “Excellent” category.

#Creating a new variable to represent wine quality as a category
winedata$quality.score <- ifelse(winedata$quality<=4,"Poor",
                                 ifelse(winedata$quality==5,
                                 "Average",
                                 ifelse(winedata$quality==6,"Good",
                                        "Excellent")))
#Update the order of factor
winedata$quality.score <- ordered(winedata$quality.score, 
                                  levels = c("Poor","Average","Good","Excellent"))
## 
##      Poor   Average      Good Excellent 
##        63       681       638       217

We can see that most of the wines falls under average and good quality. Excellent wines and Poor wines are less in number.

Alcohol

Lets look at summary of alcohol volume in wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Mean alcohol content of wine is 10.42%. Minimum level of alcohol present in a wine is 8.40%

Alcohol content is skewed right. So wines with high alochol contents are less in numbers.

Fixed Acidity

Lets look through summary of fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is positively skewed and most of the wine has a fixed acidity from 7.10 g/dm3 - 9.20 g/dm3. Mean fixed acidity is 8.32 g/dm3.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Voltile acidity is skewed right. Mean volatile acidity is 0.53 d/gm3.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid seems to have a bi-model distribution without log tranformation. With log transformation it is negatively skewed.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most wines has residual sugar between 1.9 d/gm3 to 2.6 d/gm3. Without log transformation residual sugar has a skewed right distibution. After log transformation, It has almost normal distribution.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides has a normal distibution with log10 transformation. Most of the values chlorides fall between 0.07 d/gm3 - 0.09 d/gm3.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur has a skewed right distribution. Most of the wines has free sulfur dioxide of 7 mg/dm3 to 21 mg/dm3.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Log transformation of total sulfur dioxide has a normal distribution. Average value of total sulfur dioxide is 46.47 mg/dm3.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density is normally distributed. Mean density is 0.997 g/cm3.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH has a normal distribution and mean pH is 3.31 on pH scale.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates has a normal distribution with log transformation. Most alcohol has suplhates of range 0.55 g/dm3 to 0.73 g/dm3.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations with 13 variables.

What is/are the main feature(s) of interest in your dataset?

Main feature of interest is wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think acids, sugars and alcohol will mostly drive the quality of the wine. We will explore all features to findout exactly what affects the quality of wine.

Did you create any new variables from existing variables in the dataset?

Yes. A new variable named ‘quality.score’ has created which categorises numerical wine quality to ‘Poor’, ‘Average’, ‘Good’ or ‘Excellent’.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, Citric acid has an unusual distribution. Structure of the data is not changed.

Bivariate Plots Section

Based on the plot matrix, Most correlated feature affecting wine quality is alcohol, Followed by volatile acidity, sulphates and citric acid.

Below is the correlation value of each feature with quality.

Alcohol : 0.48

Volatile Acidity : -0.39

Sulphates : 0.25

Citric Acid : 0.23

Alcohol Correlation with Quality

## 
##  Pearson's product-moment correlation
## 
## data:  winedata$quality and winedata$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

We can see that alcohol has a linear correlation with quality of the wine. Wines with high alocohol content are less in numbers. May be these wines will be of Excellent quality catgory. Lets findout more on these later.

Volatile Acidity Correlation with Quality

## 
##  Pearson's product-moment correlation
## 
## data:  winedata$quality and winedata$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

From the above plot we can see that, High quality wines has low acetic acid. There is a clear linear relationship between volatile acidity and wine quality.

Sulphates Correlation with Quality

## 
##  Pearson's product-moment correlation
## 
## data:  winedata$quality and winedata$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

With some exceptions, As pottassium suplhates increases Wine quality also increases. Main exception is, Some poor wines (Score 4-5) uses high sulphates or low sulphates, But their quality doesn’t change.

Citric Acid Correlation with Quality

## 
##  Pearson's product-moment correlation
## 
## data:  winedata$quality and winedata$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Citric acid has a small correlation with wine quality.

Correlations between features other than main features

Alcohol content varies with density. Low density wine has high alcohol content.

There is a strong correlation between pH and fixed acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol content has a linear relationship with wine quality. Volatile acidity has a negative correlation with Wine quality.Most of the average and excellent quality wines has low volume of volatile acidity.

Alcohol and volatile acidity are the main features affecting wine quality.Sulphates and citric acid volume also affects wine quality in minimal level.

Alcohol content varies with density. Low density wine has high alcohol content.Volatile acidity has a negative correlation with citric acid and citic acid has a positive correlation with fixed acidity and a negative correlation with pH.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes.Total sulfur dioxide has a positive correlation with free sulfur dioxide. pH has negative correlation with fixed acidity. Density has a positive correlation with fixed acidity and residual sugar.

What was the strongest relationship you found?

There is a strong correlation between pH and fixed acidity. But pH or fixed acidity doesn’t have a good correlation with wine quality.

Citric acid and fixed acidity has a strong correlation. Fixed acidity also has a strong correlation with density.

Multivariate Plots Section

Most of the good and excellent wines has low volume of voltaile acidity and high percentage of alcohol.

There is an interesting relationship between sulphates and alcohol. Most of the Excellent and good quality wines has high percentage of alcohol and high volume of sulphates.

A large number of good and excellent wines fall under high percentage of alcohol and high voulume of citric acid.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the multivariate plots its clear that Volatile acidity, Suphates and Citric acid also drives the quality of Wine after alcohol.

High alcohol content with low volume of volatile acidity increases the wine quality.High alcohol content with high volume of sulphates also increases the wine quality. Also a large number of good and excellent wines fall under high percentage of alcohol and high voulume of citric acid.

Were there any interesting or surprising interactions between features?

Yes, There is an interesting relationship between sulphates and alcohol. Most of the Excellent and good quality wines has high percentage of alcohol and high volume of sulphates.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Alcohol content in wine is one of the main criteria which influance the quality of the wine. Mean alcohol content in wine is 10.42%.

Plot Two

Description Two

Most of the good and excellent quality wines falls under high percentage of alcohol content and low volume of volatile acidity. Quality of the wine heavily depends on these two attributes. .

Plot Three

Description Three

Most of the Excellent and good wines has higher percentage of alcohol with higher volume of sulphates.

Reflection

Wine Data has 1599 samples and 13 features. Our target was to findout what attributes influance the quality of the wine.

On the sample dataset, Highest quality scale of wine was 8. Number of wines with high quality scores was less. Combined number of wines with quality score of 7 & 8 was only 217 out of 1599 samples. So with only a small number of high quality wines, It was a bit difficult to find the attributes which influance the quality.

So I decided to categorise the wine based on its score. I created four wine catgories “Excellent”, “Good”, “Average” and “Poor”. Quality score of “7 & 8” falls under “Excellent”, Quality of score of “6” falls under “good”, “5” as “Average” and below “5” as poor. This made a large difference as 855 wines falls under “Good” and “Excellent” wines combined.

Distribution of Alchohol, Fixed Acidity, Voltaile acidity and Sulfur is skewed right. Citric acid has a bimodel distribution. With log10 transformation, Citric acid distribution is negatively skewed. Density and pH has a normal distribution. After log transformation Residual Sugar, Chlorides, Total sulfur dioxide and Suplhates has a normal distribution.

After matrix plot, Found that Alcohol, Voltaile acid, Sulphates and Citric acid are the main attributes which influance the quality of the wine.

After plotting different bivariate analysis, Found that most of the Good and excellent quality wines has higher percentage of alcohol. Voltile acidity has a negative correlation with wine quality. As wine quality increases, Voltile acidity volume decreases. Quality of wine also has a small positive correlation with suplhates and citric acid.

After doing Multivariate analysis, Found that most of the Good and Excellent wines has higher percentage of alchohol, Lower volume of volatile acid and higher volume of Sulphates.

For future analysis, It would be interesting if we can get “viticulture” and “Vinification” datas on the dataset. I read on a site that these factors affects the quality of the wine severely. This will be hard to measure, But having that data will make the analysis more interesting.